Empirical risk estimation system, empirical risk estimation method, and empirical risk estimation program

ABSTRACT

A density estimation unit  81  is given observed covariates and estimates a conditional probability density of a random variable, denoting the real value that is the result of a smooth function map of the unobserved covariates, by training a regression model with the response corresponding to the random variable, and the regressors corresponding to the observed covariates. An integral estimation unit  82  that estimates the one-dimensional integral of the product of a sigmoidal function with the input random variable and the conditional probability density function of the random variable.

TECHNICAL FIELD

The present invention relates to an empirical risk estimation system, anempirical risk estimation method, and an empirical risk estimationprogram which estimate an expected misclassification costs of aclassifier when one or more unknown covariates are acquired.

BACKGROUND ART

In many situations, classification accuracy can be improved bycollecting more covariates. However, acquiring some of the covariatesmight incur costs. As an example, consider the diagnosis of a patient ofeither having diabetes or not. Collecting information (covariates) likeage and gender are almost at no cost, whereas taking blood measuresclearly involve costs (e.g. working hour cost of medical doctor etc). Onthe other hand there is also a cost of wrongly classifying the patientas having no diabetes, although the patient is suffering from diabetes.

Therefore, it can be argued that the final goal in classification is toreduce the total cost of misclassification, which is given by the sum ofacquired covariates' costs and the expected misclassification costs.

In general, it is assumed that the costs of acquiring a covariate andthe costs of misclassification are given. In order to reduce the totalcost of misclassification, it is necessary to estimate the expectedmisclassification costs when we were given more covariates (i.e. in theabove example more information about the patient).

Formally, this expected cost can be expressed as

$\begin{matrix}{\mspace{79mu}\lbrack {{Math}.\mspace{14mu} 1} \rbrack} & \; \\{{{\underset{x_{A}}{\mathbb{E}}\lbrack {{{BayesRisk}( x_{A\bigcup S} )}❘x_{S}} \rbrack} = {{\underset{x_{A}}{\mathbb{E}}\lbrack {{\underset{y}{\mathbb{E}}\lbrack {c_{y,{\delta*{(x_{A\bigcup S})}}}❘x_{A\bigcup S}} \rbrack}❘x_{s}} \rbrack} = {\int{\sum\limits_{y}{c_{y,{\delta*{(x_{A\bigcup S})}}}{p( {y,{x_{A}❘x_{S}}} )}{dx}_{A}}}}}},} & ( {{Equation}\mspace{14mu} 1} )\end{matrix}$

where S denotes the set of already observed covariates, A denotes thecovariates which we consider to acquire additionally. The cost ofclassifying a sample (i.e. in the above example the patient) as classy′, although the correct class is y, is denoted as c_(y,y′). In thefollowing explanation, when using a Greek letter in the text, an Englishnotation of Greek letter may be enclosed in brackets ([ ]). In addition,when representing an upper case Greek letter, the beginning of the wordin [ ] is indicated by capital letters, and when representing lower caseGreek letters, the beginning of the word in [ ] is indicated by lowercase letters. Moreover, note that in the following description, a Greekletter delta is described as d, and the union in mathematics isdescribed by U in the specification. Furthermore, d*(x_(A U S)) denotesthe Bayes classifier that uses the covariates A U S and is defined as

$\begin{matrix}\lbrack {{Math}.\mspace{14mu} 2} \rbrack & \; \\{{{\delta^{*}( x_{A\bigcup S} )} = {\underset{y^{*} \in {\{{0,1}\}}}{argmin}{\sum\limits_{y \in {\{{0,1}\}}}{{p( y \middle| x_{A\bigcup S} )} \cdot c_{y,y^{*}}}}}},} & ( {{Equation}\mspace{14mu} 2} )\end{matrix}$

where c_(y,y*) is zero if y not equal y*, otherwise, c_(y,y)>0specifying the cost of wrongly classifying a sample with true label y aslabel y*.

In the following, we call the unknown covariates A also the potentialquery covariates, or just simply query covariates, since these are thecovariates that we might want to query (e.g. conduct clinicalexperiments), and then include their outcome x_(A) into the classifier.

As can be seen from Equation 1, the calculation of expectedmisclassification costs requires the integration over all unknowncovariates A. If there are many unknown covariates, i.e. |A|>1, then theevaluation of this integral is computationally challenging, since thereis no analytic closed form solution.

NPL 1 describes a Bayesian Cost-Sensitive classification method. Themethod described in NPL1 always limits |A| to one, and thus only aone-dimensional integral needs to be solved.

Note that NPL 2 describes a learning method using labeled data withgradient descent.

CITATION LIST Non Patent Literature

[NPL 1]

Shihao Ji, Lawrence Carin, “Cost-sensitive feature acquisition andclassification”, Pattern Recognition, Volume 40, Issue 5, May 2007,pp.1474-1485.

[NPL 2]

Hastie, Trevor, Tibshirani, Robert, Friedman, Jerome, “The Elements ofStatistical Learning”, Springer-Verlag New York , 2009.

SUMMARY OF INVENTION Technical Problem

As described above, The method described in NPL1 cannot estimate theexpected misclassification costs when there are more than one querycovariates. However, this can lead to the sub-optimal decision ofstopping to query covariates, even though the total costs ofmisclassifications could be decreased further.

In the following, we given a detailed example, which shows that this isa problem even if the data is linearly separable. Let us denote by V theset of all possible covariates, S denotes the set of already observedcovariates, and A denotes the covariates which we consider to acquireadditionally. Let us define the total expected costs when acquiringcovariates

[Math.  3] A ⊆ (V ∖ S) as${{t(A)}:={{\underset{x_{A}}{\mathbb{E}}\lbrack {{BayesRisk}( x_{A\bigcup S} )} \middle| x_{S} \rbrack} + {\sum\limits_{i \in A}f_{i}}}},$

where f_(i) is the cost of acquiring covariate i. The method describedin NPL1 also tries to optimize t(A), but uses a greedy approach thatselects the set A for which t(A) is minimal and |A|<=1. The algorithmstops, if A={0} is selected. The following example shows that a methodconsidering only |A|<=1 can fail.

Let us consider the situation where

V\S={x ₁ , x ₂}  [Math. 4]

and the conditional joint distribution of x₁ and x₂ is an isotropicGaussian with zero mean: p(x₁, x₂|x_(s))=N(x₁, x₂|0, I).

For simplicity, we assume the misclassification costs arec_(0,1)=c_(1,0)=c>0, and c_(y,y)=0. Furthermore, again for simplicity,the costs of querying covariate xi is assumed to be the same as x₂,which we denote by f>0.

Let us assume the following decision boundary between class 1 and class0:

class 1⇔x ₂ ≥mx ₁ +r,   [Math. 5]

where without loss of generality, we assume m>0 and r>0, as illustratedin FIG. 7. FIG. 7 depicts an explanatory diagram illustrating an exampleof a decision boundary between classes. Furthermore, in FIG. 7, thecontour plots of constant density of the conditional joint probabilityp(x₁, x₂|x_(s)) are shown. We consider the four cases A={0}, A={x₁},A={x₂}, and A={x₁, x₂}. For each A we calculate the expectedmisclassification costs, which we denote as [alpha]_(A).

First, let A={x₁, x₂}, then

$\begin{matrix}{\mspace{79mu}{\lbrack {{Math}.\mspace{14mu} 6} \rbrack\mspace{20mu}{{\delta^{*}( x_{A\bigcup S} )} = \{ {\begin{matrix}1 & {{{{if}\mspace{14mu} x_{2}} \geq {{mx}_{1} + r}},} \\0 & {{else}.}\end{matrix}\mspace{20mu}{{and}\mspace{20mu}\lbrack {{Math}.\mspace{14mu} 7} \rbrack}} }}} \\{a_{({x_{1},x_{2}})}:={{\underset{x_{A}}{\mathbb{E}}\lbrack {{{BayesRisk}( x_{A\bigcup S} )}❘x_{S}} \rbrack} = {{{\int{( {\sum\limits_{y}{c_{y,{\delta*{(x_{A\bigcup S})}}}{p( {{y❘x_{A}},x_{S}} )}}} ){p( {x_{A}❘x_{S}} )}1_{x_{2} \geq {{mx}_{1} + r}}( x_{A} ){dx}_{A}}} + {\int{( {\sum\limits_{y}{c_{y,{\delta*{(x_{A\bigcup S})}}}{p( {{y❘x_{A}},x_{S}} )}}} )p( {x_{A}❘x_{S}} )1_{x_{2} \geq {{mx}_{1} + r}}( x_{A} ){dx}_{A}}}} = {{{\int{( {\sum\limits_{y}{c_{y,1}{p( {{y❘x_{A}},x_{S}} )}}} )p( {x_{A}❘x_{S}} )1_{x_{2} \geq {{mx}_{1} + r}}( x_{A} ){dx}_{A}}} + {\int{( {\sum\limits_{y}{c_{y,0}{p( {{y❘x_{A}},x_{S}} )}}} ){p( {x_{A}❘x_{S}} )}1_{x_{2} \geq {{mx}_{1} + r}}( x_{A} ){dx}_{A}}}} = {{{\int{( {c_{0,1}{p( {{y = {0❘x_{A}}},x_{S}} )}} )p( {x_{A}❘x_{S}} )1_{x_{2} \geq {{mx}_{1} + r}}( x_{A} ){dx}_{A}}} + {\int{( {c_{1,0}{p( {{y = {1❘x_{A}}},x_{S}} )}} ){p( {x_{A}❘x_{S}} )}1_{x_{2} \geq {{mx}_{1} + r}}( x_{A} ){dx}_{A}}}} = 0.}}}}}\end{matrix}$

Next, let A={x₁}, then we have

  [Math.  8]${p( {{y = {1❘x_{1}}},x_{S}} )} = {{p( {{{x_{2} \geq {{mx}_{1} + r}}❘x_{1}},x_{S}} )} = {{\int_{{mx}_{1} + r}^{\infty}{{N( {{x_{2}❘0},1} )}{{dx}_{2}.\mspace{79mu}{\delta^{*}( x_{A\bigcup S} )}}}} = \{ \begin{matrix}1 & {{{{if}\mspace{14mu}{\int_{{mx}_{1} + r}^{\infty}{{N( {{x_{2}❘0},1} )}{dx}_{2}}}} \geq 0.5},} \\0 & {{else}.}\end{matrix} }}$

Define b as the value of x₁ for which

∫_(mx) ₁ _(+r) ^(∞) N(x ₂|0, 1)dx ₂=0.5.   [Math. 9]

Since

∫₀ ^(∞) N(x ₂|0, 1)dx ₂=0.5   [Math. 10]

we have b=−r/m. we have

  [Math.  11]$\alpha_{(x_{1})}:={{\underset{x_{A}}{\mathbb{E}}\lbrack {{{BayesRisk}( x_{A\bigcup S} )}❘x_{S}} \rbrack} = {{\int{( {\sum\limits_{y}{c_{y,{\delta*{({x_{1},x_{S}})}}}{p( {{y❘x_{1}},x_{S}} )}}} ){p( {x_{1}❘x_{S}} )}{dx}_{1}}} = {{{\int_{- \infty}^{b}{( {\sum\limits_{y}{c_{y,1}{p( {{y❘x_{1}},x_{S}} )}}} ){p( {x_{1}❘x_{S}} )}{dx}_{1}}} + {\int_{b}^{\infty}{( {\sum\limits_{y}{c_{y,0}{p( { y \middle| x_{1} ,x_{S}} )}}} ){p( {x_{1}❘x_{S}} )}dx_{1}}}} = {{\int_{- \infty}^{b}{( {c_{0,1}{p( {{y = {0❘x_{1}}},x_{S}} )}} ){p( {x_{1}❘x_{S}} )}{dx}_{1}}} + {\int_{b}^{\infty}{( {c_{1,0}{p( {{y =  1 \middle| x_{1} },x_{S}} )}} ){p( {x_{1}❘x_{S}} )}{{dx}_{1}.}}}}}}}$

Analogously, we can calculate the expected Bayes Risk {x₂}.

Finally, let A={0}. Let us define the random variable z :=x₂−mx₁−r.Since x₁ and x₂ are independent standard normal variavles, we havez˜N(−r, m²+1). We therefore have

  [Math.  12]p(y = 1❘x_(S)) = p(x₂ ≥ mx₁ + r❘x_(S)) = p(x₂ − mx₁ − r ≥ 0|x_(S)) = p(z ≥ 0) = ∫₀^(∞)N(z❘−r, m² + 1)dz < 0.5,

since we assume r>0. Therefore, we have d*(x_(s))=0. And, as aconsequence, we have

$\begin{matrix}\begin{matrix}{{{\alpha_{\{\theta\}}\text{:} = {\underset{x_{A}}{\mathbb{E}}\lbrack {{{Bayes}\mspace{14mu}{Risk}\mspace{11mu}( x_{AUS} )}❘x_{S}} \rbrack}} = {\sum\limits_{y}c_{y}}},{{\delta^{*}( x_{S} )}{p( {y❘x_{S}} )}}} \\{= {\sum\limits_{y}{c_{y,0}{p( {y❘x_{S}} )}}}} \\{= {c_{1,0}{{p( {y = {1❘x_{S}}} )}.}}}\end{matrix} & \lbrack {{Math}.\mspace{14mu} 13} \rbrack\end{matrix}$

Without loss of generality, let us assume that[alpha]_({x1})<[alpha]_({x2}), and the costs of each covariate is f>0.The greedy algorithm with |A|<=1, fails if (I) t({0})<t({x₁}) or (II)t({0})>t({x₁, x₂}). That means that (I) [alpha]_({0})<[alpha]_({x1})f or(II) [alpha]_({0})>2f, which is equivalent to[alpha]_({x1})>[alpha]_({0})/2.

Therefore, except for the case r=0, there is always a covariate cost f>0for which the greedy algorithm will fail. As a concrete numericalexample, let us assume that r=m=1, c_(0,1)=c_(1,0)=100, and f=10. Thetotal expected costs for each query set are listed in Table 1.

TABLE 1 A t(A) {∅} 24.0 {x₂} 28.2 {x₂} 28.2 {x₁, x₂} 20.0

It is an exemplary object of the present invention to provide anempirical risk estimation system, an empirical risk estimation method,and an empirical risk estimation program which, even when the number ofquery covariates is more than one, can estimate an empirical risk withhigh accuracy at low computational costs.

Solution to Problem

A empirical risk estimation system according to the present inventionincludes: a density estimation unit that estimates a conditionalprobability density of a random variable, denoting the real value thatis the result of a smooth function map of given unobserved covariates,by training a regression model with the response corresponding to therandom variable, and the regressors corresponding to the observedcovariates; and an integral estimation unit that estimates theone-dimensional integral of the product of a sigmoidal function with theinput random variable and the conditional probability density functionof the random variable.

A empirical risk estimation method according to the present inventionincludes: estimating a conditional probability density of a randomvariable, denoting the real value that is the result of a smoothfunction map of given unobserved covariates, by training a regressionmodel with the response corresponding to the random variable, and theregressors corresponding to the observed covariates; and estimating theone-dimensional integral of the product of a sigmoidal function with theinput random variable and the conditional probability density functionof the random variable.

A empirical risk estimation program according to the present inventioncauses a computer to perform: a density estimation process of estimatinga conditional probability density of a random variable, denoting thereal value that is the result of a smooth function map of givenunobserved covariates, by training a regression model with the responsecorresponding to the random variable, and the regressors correspondingto the observed covariates; and an integral estimation process ofestimating the one-dimensional integral of the product of a sigmoidalfunction with the input random variable and the conditional probabilitydensity function of the random variable.

Advantageous Effects of Invention

According to the present invention, even when the number of querycovariates is more than one, it is possible to estimate an empiricalrisk with high accuracy at low computational costs.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 It depicts an exemplary block diagram illustrating the structureof an exemplary embodiment of an empirical risk estimation systemaccording to the present invention.

FIG. 2 It depicts an exemplary explanatory diagram illustrating thestructure of an exemplary embodiment of the empirical risk estimationsystem according to the present invention.

FIG. 3 It depicts an exemplary explanatory diagram illustratingdifferent Sigmoid function approximations.

FIG. 4 It depicts a flowchart illustrating an operation example of theempirical risk estimation system in this exemplary embodiment.

FIG. 5 It depicts a block diagram illustrating an outline of anempirical risk estimation system according to the present invention.

FIG. 6 It depicts a schematic block diagram illustrating theconfiguration example of the computer according to the exemplaryembodiment of the present invention.

FIG. 7 It depicts depicts an explanatory diagram illustrating an exampleof a decision boundary between classes.

DESCRIPTION OF EMBODIMENTS

The following describes an exemplary embodiment of the present inventionwith reference to drawings.

FIG. 1 is an exemplary block diagram illustrating the structure of anexemplary embodiment of an empirical risk estimation system according tothe present invention. FIG. 2 is an exemplary explanatory diagramillustrating the structure of an exemplary embodiment of the empiricalrisk estimation system according to the present invention.

In the exemplary present embodiment, it is assumed that the conditionalclass probability can be expressed as the following generalized additivemodel.

p(y|x _(A) , X _(S))=g(f _(A)(x _(A))+f _(S)(x _(S))+τ),   [Math. 14]

where g is a sigmoid function, e.g. the logistic function, a greekletter tau (hereinafter [tau]) is the bias, and f_(A):R^(|A|)->R, and,f_(S):R^(|S|->R are arbitrary smooth functions. The method of learning [tau] and the functions is arbitrary, and for example, commonly, the [tau] and the functions have been learned using labeled data with gradient descent. The method described in NPL)2 may be used for learning. In the exemplary present embodiment, it isassumed that the [tau] and the functions to be given.

For example in the case of a classifier with linear decision boundary,we have

p(y|x _(A) , x _(S))=g(β^(T) x+τ)=g(β_(A) ^(T) x _(A)+β_(S) ^(T) x_(S)+τ)   [Math. 15]

where a greek letter beta (hereinafter [beta]) is the weight vector ofthe classifier that was learned from labeled data. We remark that[beta]_(A) and [beta]_(S) denote the sub-vectors of [beta] correspondingto the covariates A and S, respectively.

The expected misclassification costs can be expressed as follows.

$\begin{matrix}\lbrack {{Math}.\mspace{14mu} 16} \rbrack & \; \\\begin{matrix}\begin{matrix}{{{\underset{x_{A}}{\mathbb{E}}\lbrack {{{Bayes}\mspace{14mu}{Risk}\mspace{11mu}( x_{AUS} )}❘x_{S}} \rbrack} = {\int{\sum\limits_{y}c_{y}}}},{{\delta^{*}( x_{AUS} )}{p( {y,{x_{A}❘x_{S}}} )}{dx}_{A}}} \\{{= {\int{\sum\limits_{y}c_{y}}}},{{\delta^{*}( x_{AUS} )}{p( {{y❘x_{A}},x_{S}} )}{p( {x_{A}❘x_{S}} )}{dx}_{A}}} \\{{= {\int{\sum\limits_{y}c_{y}}}},{{\delta^{*}( {{f_{A}( x_{A} )},{f_{S}( x_{S} )}} )}{p( {{y❘{f_{A}( x_{A} )}},} }}} \\{ {f_{S}( x_{S} )} ){p( {x_{A}❘x_{S}} )}{dx}_{A}} \\{{= {\int{\sum\limits_{y}c_{y}}}},{{\delta^{*}( {z,{f_{S}( x_{S} )}} )}{p( {{y❘z},{f_{S}( x_{S} )}} )}{h(z)}{dz}},}\end{matrix} & \;\end{matrix} & ( {{Equation}\mspace{14mu} 3} )\end{matrix}$

where we introduced the random variable z :=f_(A)(x_(A)), with densityh(z) :=p(z|x_(S)). The resulting integral in Equation 3 is only aone-dimensional integral in z. However, it requires us to estimate h(z).

The empirical risk estimation system 100 according to the presentexemplary embodiment includes a density estimation unit 10, an integralestimation unit 20 and a storage unit 30.

The density estimation unit 10 estimates the h(z). In particular, thedensity estimation unit 10 is given the observed covariates S, andestimates a conditional probability density of a random variable z bytraining a regression model with the response corresponding to z, andthe regressors corresponding to the covariates S. The z is denoting thereal value that is the result of a smooth function map of the unobservedcovariates A.

In the following, it is explained how h(z) can be estimated using linearor non-linear regression. Let us denote by {x^((i))}^(n) _(i=1) thecollection of unlabeled data. Note that the density estimation unit 10does not require class-labeled data. From the collection of unlabeleddata, the density estimation unit 10 may form a collection of responseand explanatory variable pairs of the form {(z^((i)), x_(S) ^((i)))}^(n)_(i=1), where z^((i))=f_(A)(x_(A) ^((i))). For example, if it is assumedthat a linear relationship between z and x_(S) with normal noise, thenthe density estimation unit 10 has

p(z|x _(S))=N(γ^(T) x _(S), σ²), [Math. 17]

for some parameter vector

γ ∈

^(|S|) and σ² ∈

  [Math. 18]

which can be estimated from the data {(z^((i)), x_(S) ^((i)))}^(n)_(i=1). Note that a greek letter mu is denoted as [mu], an uppercasegreek letter Sigma is denoted as [Sigma], and a lowercase greek lettersigma is denoted as [sigma]. For example, if the joint distribution p(x)is a multivariate normal distribution N([mu], [Sigma]), and p(y|x_(A),x_(B)) follows a logistic regression model with weight vector b, thenthe maximum likelihood estimate leads to

p(z|x _(S))=N(β_(A) ^(T)μ_(A|S), β_(A) ^(T)Σ_(A|S)β_(A)),   [Math. 19]

with

μ_(A|S)=μ_(A)+Σ_(A,S)Σ_(S,S) ⁻¹(x _(S)−μ_(S)),

Σ_(A|S)=Σ_(A,A)−Σ_(A,S)Σ_(S,S) ⁻¹Σ_(S,A).

That is, the density estimation unit 10 may estimate the conditionalprobability density of z by the normal distribution.

If a linear relationship between z and x_(S) is unreasonable, then anon-parametric regression model like Gaussian processes might be moreappropriate. As before, let x^((i)) (x^((i)) belongs to R^(p)) be thei-th sample of x available at training time, and let x*_(S) be theobserved covariates of a new sample at test time. Then the matrixK(X_(S), X_(S)) is defined as follows

K(X _(S) , X _(S))_(ij) =k(x _(S) ^((i)) , x _(S) ^((j))),   [Math. 20]

where k is a covariance function, for example, using the exponentialsquared covariance function the density estimation unit 10 has

$\begin{matrix}{{{k( {x_{S}^{(i)},x_{S}^{(j)}} )} = e^{- \frac{{{x_{S}^{(i)} - x_{S}^{(j)}}}_{2}^{2}}{l^{2}}}},} & \lbrack {{Math}.\mspace{14mu} 21} \rbrack\end{matrix}$

where 1 is the length scale parameter. Furthermore, the densityestimation unit 10 defines the column vector z (z belongs to R^(n)) as

z _(i) =f _(A)(x _(A) ^((i))).   [Math. 22]

And for a new sample x*, at test time, the density estimation unit 10defines, analogously

z*=f _(A)(x* _(A))   [Math. 23]

Finally, the density estimation unit 10 defines the column vectork(x*_(S), X_(S)) (k(x*_(S), X_(S)) belongs to R^(n)) as follows

k(x* _(S) , X _(S))_(i) =k(x* _(S) , x _(S) ^((i))).   [Math. 24]

Then under the Gaussian process assumption with additional Gaussiannoise with variance [sigma]₀ ², the density estimation unit 10 has

$\begin{matrix}{{\begin{pmatrix}z \\z^{*}\end{pmatrix} = {N( {\begin{pmatrix}{\mu_{0}1_{n}} \\\mu_{0}\end{pmatrix},\begin{pmatrix}{{K( {X_{S},X_{S}} )} + {\sigma_{0}^{2}I_{n}}} & {k( {x_{S}^{*},X_{S}} )} \\{k( {x_{S}^{*},X_{S}} )}^{T} & {k( {x_{S}^{*},x_{S}^{*}} )}\end{pmatrix}} )}},} & \lbrack {{Math}.\mspace{14mu} 25} \rbrack\end{matrix}$

where the density estimation unit 10 assumes a fixed mean [mu]₀ given by

$\begin{matrix}{{\mu_{0} = {\frac{1}{n}{\underset{i = 1}{\sum\limits^{n}}z_{i}}}},} & \lbrack {{Math}.\mspace{14mu} 26} \rbrack\end{matrix}$

and 1_(n) (1_(n) belongs to R^(n)) is the vector with all one. As aconsequence the density estimation unit 10 has

h(z)=N(μ, σ²),   [Math. 27]

with

μ=μ₀ +k(x* _(S) , X _(S))^(T)(K(X _(S) , X _(S))+σ₀ ² I _(u))⁻¹(t−1μ₀),

σ² =k(x* _(S) , x* _(S))−k(x* _(S) , X _(S))^(T)(K(X _(S) , X _(S))+σ₀ ²I _(u))⁻¹ k(x* _(S) , X _(S)).

The integral estimation unit 20 estimates Equation 3. In particular, theintegral estimation unit 20 estimates the one-dimensional integral ofthe product of a sigmoidal function g with input z and the conditionalprobability density function of z.

The integral estimation unit 20 may simply use Monte Carlo samples fromh(z) in order to estimate Equation 3. On the other hand, in order toimprove the processing speed, the integral estimation unit 20 may use adifferent strategy based on a piece-wise linear approximation of thesigmoid function g as explained in the following.

First, the integral estimation unit 20 expresses the expectedmisclassification cost as follows

$\begin{matrix}\begin{matrix}{{\underset{x_{A}}{\mathbb{E}}\lbrack {{{Bayes}\mspace{14mu}{{Risk}( x_{AUS} )}}❘x_{S}} \rbrack} = {\underset{x_{A}}{\mathbb{E}}\lbrack {{\sum\limits_{y}c_{y}},{{{\delta^{*}( x_{AUS} )}{p( {y❘x_{AUS}} )}}❘x_{S}}} \rbrack}} \\{= {\underset{x_{A}}{\mathbb{E}}\lbrack {c_{0},{{{\delta^{*}( x_{AUS} )}{p( {y = {0❘x_{AUS}}} )}} +}} }} \\ {c_{1},{{{\delta^{*}( x_{AUS} )}{p( {y = {1❘x_{AUS}}} )}}❘x_{S}}} \rbrack \\{= {{\underset{x_{A}}{\mathbb{E}}\lbrack {c_{0},{{{\delta^{*}( x_{AUS} )}{p( {y = {0❘x_{AUS}}} )}}❘x_{S}}} \rbrack} +}} \\{{\underset{x_{A}}{\mathbb{E}}\lbrack {c_{1},{{{\delta^{*}( x_{AUS} )}{p( {y = {1❘x_{AUS}}} )}}❘x_{S}}} \rbrack}.}\end{matrix} & \lbrack {{Math}.\mspace{14mu} 28} \rbrack\end{matrix}$

Next, note that

δ*(x _(A∪S))=argmin[p(y=1|x _(A∪S))·c _(1,0) , p(y=0|x _(A∪S))·c_(0,1)].   [Math. 29]

Furthermore, the integral estimation unit 20 has

$\begin{matrix}{{\begin{matrix}{{\delta^{*}( x_{AUS} )} =  1\Leftrightarrow{\frac{{p( {y = {1❘x_{AUS}}} )} \cdot c_{1,0}}{{p( {y = {1❘x_{AUS}}} )} \cdot c_{0,1}} \geq 1} } \\{ \Leftrightarrow{\frac{{g( {{f_{A}( x_{A} )} + {f_{S}( x_{S} )} + \tau} )} \cdot c_{1.0}}{( {1 - {g( {{f_{A}( x_{A} )} + {f_{S}( x_{S} )} + \tau} )}} ) \cdot c_{0,1}} \geq 1} } \\{ \Leftrightarrow{e^{{f_{A}{(x_{A})}} + {f_{S}{(x_{S})}} + \tau} \geq \frac{c_{0.1}}{c_{1,0}}} } \\{ \Leftrightarrow{{f_{A}( x_{A} )} \geq {{\log( \frac{c_{0,1}}{c_{1,0}} )} - \tau - {f_{S}( x_{S} )}}} } \\{ \Leftrightarrow{z \geq \zeta} ,}\end{matrix}{where}{z\text{:}} = {f_{A}( x_{A} )}},\mspace{14mu}{{{and}\mspace{14mu}\zeta\text{:}} = {{\log( \frac{c_{0,1}}{c_{1,0}} )} - \tau - {f_{S}( x_{S} )}}}} & \lbrack {{Math}.\mspace{14mu} 30} \rbrack\end{matrix}$

As described above, d*(x_(A U S)) depends only z (random variable) and[zeta] (fixed). Thus, the integral estimation unit 20 has

$\begin{matrix}\begin{matrix}{{\underset{x_{A}}{\mathbb{E}}\lbrack {{c_{1,}{\delta^{*}( x_{AUS} )}{p( {y = {1❘x_{AUS}}} )}}❘x_{S}} \rbrack} = {\int{c_{1,{\delta^{*}{({x,\zeta})}}}{g( {z + {f_{S}( x_{S} )} + \tau} )}{h(z)}{dz}}}} \\{= {{\int_{- \infty}^{\zeta}{c_{1,0}{g( {z + {f_{S}( x_{S} )} + \tau} )}{h(z)}{dz}}} +}} \\{\int_{\zeta}^{\infty}{c_{1,1}{g( {z + {f_{S}( x_{S} )} + \tau} )}{h(z)}{dz}}} \\{= {{c_{1,0}{\int_{- \infty}^{\zeta}{{g( {z + {f_{S}( x_{S} )} + \tau} )}{h(z)}{dz}}}} +}} \\{c_{1,1}{\int_{\zeta}^{\infty}{{g( {z + {f_{S}( x_{S} )} + \tau} )}{h(z)}{dz}}}} \\{= {c_{1,0}{\int_{- \infty}^{\zeta}{{g( {z + {f_{S}( x_{S} )} + \tau} )}{h(z)}{{dz}.}}}}}\end{matrix} & \lbrack {{Math}.\mspace{14mu} 31} \rbrack\end{matrix}$

And, analogously, the integral estimation unit 20 has

$\begin{matrix}\begin{matrix}{{\underset{x_{A}}{\mathbb{E}}\lbrack {{c_{0,}{\delta^{*}( x_{AUS} )}{p( {y = {0❘x_{AUS}}} )}}❘x_{S}} \rbrack} = {{c_{0,0}{\int_{- \infty}^{\zeta}{( {1 - {g( {z + {f_{S}( x_{S} )} + \tau} )}} ){h(z)}{dz}}}} +}} \\{c_{0,1}{\int_{\zeta}^{\infty}{( {1 - {g( {z + {f_{S}( x_{S} )} + \tau} )}} ){h(z)}{dz}}}} \\{= {c_{0,1}{\int_{\zeta}^{\infty}{( {1 - {g( {z + {f_{S}( x_{S} )} + \tau} )}} ){h(z)}{dz}}}}} \\{= {{c_{0,1}{\int_{\zeta}^{\infty}{{h(z)}{dz}}}} - {c_{0,1}{\int_{\zeta}^{\infty}{{g( {z + {f_{S}( x_{S} )} + \tau} )}{h(z)}{{dz}.}}}}}}\end{matrix} & \lbrack {{Math}.\mspace{14mu} 32} \rbrack\end{matrix}$

Thus the remaining task is to evaluate the following integral

[Math. 33]

∫_(a′) ^(b′) g(z+f _(S)(x _(S))+τ)h(z)dz=∫ _(a′+f) _(S) _((x) _(S)_()+τ) ^(b′+f) ^(S) ^((x) ^(S) ^()+τ) g(u)h(u−f _(S)(x _(S))−τ)du.  (Equation 4)

One popular strategy is to approximate the sigmoid function g by thecumulative distribution function of the standard normal distribution[Phi]. However, it turns out that this approximation does not work here,since a or b is bounded in our case. Instead, in the exemplary presentembodiment, the integral estimation unit 20 uses here the fact that thesigmoid function can be well approximated with only a few number oflinear functions. It is assumed that h(z) is a normal distribution withmean [mu]′ and variance [sigma]². In order to facilitate notation, thefollowing constants are introduced.

a :=a′+f _(S)(x _(S))+τ,

b :=b′+f _(S)(x _(S))+τ,

μ :=μ+f _(S)(x _(S))+τ.   [Math. 34]

Then the integral in Equation 4 can be written as

$\begin{matrix}\lbrack {{Math}.\mspace{14mu} 35} \rbrack & \; \\{\int_{a}^{b}{{g(u)}\frac{1}{\sqrt{2\;\pi\;\sigma^{2}}}e^{{- \frac{1}{2\;\sigma^{2}}}{({u - \mu})}^{2}}{{du}.}}} & ( {{Equation}\mspace{20mu} 5} )\end{matrix}$

The integral estimation unit 20 has define the following piece-wiselinear approximation of the sigmoid function

$\begin{matrix}{{{{g(u)} \approx {\sum\limits_{t = 1}^{\xi + 2}( {1_{\lbrack{b_{t - 1},b_{t}}\rbrack}(u)( {{m_{t}u} + v_{t}} )} )}},{where}}{{{for}\mspace{14mu} 1} \leq t \leq {\xi + 1}}{{b_{t}\text{:}} = {{{- 1}0} + {\frac{20}{\xi}( {t - 1} )}}}{and}{{{for}\mspace{14mu} 1} \leq t \leq \xi}{{{m_{t + 1}\text{:}} = \frac{{g( b_{t + 1} )} - {g( b_{t} )}}{b_{t + 1} - b_{t}}},{{v_{t + 1}\text{:} = {g( b_{t} )}} - {m_{t + 1}b_{t}}},{and}}{{b_{0}\text{:}} = {- \infty}},{{m_{1}\text{:}} = 0},{v_{1}\text{:} = {g( b_{1} )}},{{b_{\xi + 2}\text{:}} = {+ \infty}},{{m_{\xi + 2}\text{:}} = 0},{{v_{\xi + 2}\text{:}} = {g( b_{\xi + 1} )}},} & \lbrack {{Math}.\mspace{14mu} 36} \rbrack\end{matrix}$

and [xi] is the number of linear approximations, which is, for example,set to 40. A comparison with the approximation

$\begin{matrix}{\Phi( {\sqrt{\frac{\pi}{8}}u} )} & \lbrack {{Math}.\mspace{14mu} 37} \rbrack\end{matrix}$

is shown in FIG. 3. FIG. 3 depicts an exemplary explanatory diagramillustrating different Sigmoid function approximations. In FIG. 3, aline 41 represents a Sigmoid, a line 42 represents a linearapproximation, a line 43 represents a normal CDF (cumulativedistribution function) approximation, and a line 44 represents adiscrete approximation. According to NPL1, for the linear functionapproximation and the discrete bin approximation, [xi]=40 is set. Forthe normal CDF approximation,

$\begin{matrix}{\Phi( {\sqrt{\frac{\pi}{8}}u} )} & \lbrack {{Math}.\mspace{14mu} 38} \rbrack\end{matrix}$

is used.

This shows that for a relative few number of linear approximations, theintegral estimation unit 20 can achieve an approximation that is moreaccurate than the [Phi]-approximation. More importantly, as shown below,this allows for a tractable calculation of the integral in Equation 5,which is not the case when using the [Phi]-approximation.

Then the integral estimation unit 20 has

$\begin{matrix}{\begin{matrix}{{\int_{a}^{b}{{g(u)}\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{{- \frac{1}{2\;\sigma^{2}}}{({u - \mu})}^{2}}{du}}} = {\int_{a}^{b}{\sum\limits_{t = 1}^{\xi + 2}{( {1_{\lbrack{b_{t - 1},b_{t}}\rbrack}(u)( {{m_{t}u} + v_{t}} )} )\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{{- \frac{1}{2\;\sigma^{2}}}{({u - \mu})}^{2}}{du}}}}} \\{= {\sum\limits_{t = 1}^{\xi + 2}{\int_{\max{({a,b_{t - 1}})}}^{\min{({b,b_{t}})}}{( {{m_{t}u} + v_{t}} )\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{{- \frac{1}{2\;\sigma^{2}}}{({u - \mu})}^{2}}{du}}}}} \\{= {{\sum\limits_{t = 1}^{\xi + 2}{m_{t}{\int_{\max{({a,b_{t - 1}})}}^{\min{({b,b_{t}})}}{u\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{{- \frac{1}{2\;\sigma^{2}}}{({u - \mu})}^{2}}{du}}}}} +}} \\{{v_{t}\Phi_{\max{({a,b_{t - 1}})}}^{\min{({b,b_{t}})}}},}\end{matrix}{where}{\Phi_{l}^{o}\text{:} = {\int_{l}^{o}{\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{{- \frac{1}{2\;\sigma^{2}}}{({u - \mu})}^{2}}{du}}}}} & \lbrack {{Math}.\mspace{14mu} 39} \rbrack\end{matrix}$

which can be well approximated with standard implementations. Theremaining integral can also be expressed by [Phi] using the substitutionu−[mu] :=r, the integral estimation unit 20 has

$\begin{matrix}\begin{matrix}{{\int_{l}^{\circ}{u\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{{- \frac{1}{2\;\sigma^{2}}}{({u - \mu})}^{2}}{du}}} = {\int_{l - \mu}^{o - \mu}{( {r + \mu} )\frac{1}{\sqrt{2{\pi\sigma}^{2}}}e^{{- \frac{1}{2\;\sigma^{2}}}r^{2}}{dr}}}} \\{= {{\int_{l - \mu}^{o - \mu}{r\frac{1}{\sqrt{2{\pi\sigma}^{2}}}e^{{- \frac{1}{2\;\sigma^{2}}}r^{2}}{dr}}} + {\mu\Phi}_{l}^{o}}} \\{= {{\frac{1}{\sqrt{2{\pi\sigma}^{2}}}\lbrack {{- \sigma^{2}}e^{{- \frac{1}{2\;\sigma^{2}}}r^{2}}} \rbrack}_{l - \mu}^{o - \mu} + {\mu\Phi}_{l}^{o}}} \\{= {{\frac{\sigma}{\sqrt{2\pi}}( {e^{{- \frac{1}{2\;\sigma^{2}}}{({l - \mu})}^{2}} - e^{{- \frac{1}{2\;\sigma^{2}}}{({o - \mu})}^{2}}} )} + {{\mu\Phi}_{l}^{o}.}}}\end{matrix} & \lbrack {{Math}.\mspace{14mu} 40} \rbrack\end{matrix}$

In this way, the integral estimation unit 20 may estimate theone-dimensional integral by using a piece-wise linear approximation ofthe sigmoid function.

The storage unit 30 stores various data. The storage unit 30 may storeunlabeled Data {x}. The storage unit 30 is realized by a magnetic diskor the like.

The density estimation unit 10 and the integral estimation unit 20 areeach implemented by a CPU of a computer that operates in accordance witha program (empirical risk estimation program). For example, the programmay be stored in a storage unit 30 included in the empirical riskestimation system 100, and the CPU may read the program and operate asthe density estimation unit 10 and the integral estimation unit 20 inaccordance with the program.

In the empirical risk estimation system of the exemplary presentembodiment, the density estimation unit 10 and the integral estimationunit 20 may each be implemented by dedicated hardware. Further, theempirical risk estimation system according to the present invention maybe configured with two or more physically separate devices which areconnected in a wired or wireless manner.

The following describes an example of the empirical risk estimationsystem in this exemplary embodiment. FIG. 4 is a flowchart illustratingan operation example of the empirical risk estimation system in thisexemplary embodiment.

The density estimation unit 10 inputs partially observed data samplex_(S), index of unknown covariates A, and unlabeled Data {x} (StepS101). The density estimation unit 10 estimates conditional probabilityp(x_(A)|x_(S)) (Step S102). The density estimation unit 10 approximatesprobability p(x^(T) _(A)[beta]_(A)|x_(S)) by a normal distribution h(z)(Step S103).

The integral estimation unit 20 calculate threshold z* such that if z>z*then d*(x_(S U A))=1 else d*(x_(S U A))=0. (Step S104). The integralestimation unit 20 performs piece-wise linear approximation of g andexpress the following integrals in terms of Gaussian CDF (Step S105):

∫_(z*) ^(∞) g(z+β _(S) ^(T) x _(S)+τ))h(z)dz,

∫_(−∞) ^(z*) g(z+β _(S) ^(T) x _(S)+τ))h(z)dz.   [Math. 41]

The integral estimation unit 20 evaluatesE_(xA)[BayesRisk(x_(A U S))|x_(S)] (Step S106). In this way, thecovariates A is acquired, Bayes Risk is estimated.

In this manner, in the present exemplary embodiment, the densityestimation unit 10 estimates a conditional probability density of z bytraining a regression model with the response corresponding to z, andthe regressors corresponding to the observed covariates S. Then theintegral estimation unit 20 estimates the one-dimensional integral ofthe product of a sigmoidal function g with input z and the conditionalprobability density function of z.

With the above structure, even when the number of query covariates ismore than one, it is possible to estimate an empirical risk with highaccuracy at low computational costs.

That is, in the exemplary present embodiment, a classifier which classprobability is a additive function of the feature map of the querycovariates is considered, and the value of the sum of those feature mapsis a real-valued number. This real-valued number is considered as arandom variable for which we directly estimate the conditionaldistribution given the already observed covariates. Then the integralestimation unit 20 estimates the expected misclassification costs withrespect to this conditional distribution.

In this case, in the exemplary present embodiment, even when the numberof query covariates is more than one, it is only necessary to solve aone-dimensional integral in order to estimate the expectedmisclassification costs. Therefore, in contrast to high dimensionalintegrals, the one dimensional integral can be solved with numericmethods with high accuracy at low computational costs.

Next, an outline of the present invention will be described. FIG. 5 is ablock diagram illustrating an outline of the empirical risk estimationsystem according to the present invention. The empirical risk estimationsystem 80 (for example, empirical risk estimation system 100) accordingto the present invention includes: a density estimation unit 81 (forexample, the density estimation unit 10) that is given observedcovariates (for example, S) and estimates a conditional probabilitydensity of a random variable (for example z), denoting the real valuethat is the result of a smooth function map of the unobserved covariates(for example A), by training a regression model with the responsecorresponding to the random variable (for example z), and the regressorscorresponding to the observed covariates (for example, S); and anintegral estimation unit 82 (for example, integral estimation unit 20)that estimates the one-dimensional integral of the product of asigmoidal function (for example g) with the input random variable (forexample z) and the conditional probability density function of therandom variable (for example z).

With such a configuration, even when the number of query covariates ismore than one, it is possible to estimate an empirical risk with highaccuracy at low computational costs.

In addition, the density estimation unit 81 may estimate the conditionalprobability density of the random variable (for example z) by a normaldistribution, and the integral estimation unit 82 may estimate theone-dimensional integral by using a piece-wise linear approximation ofthe sigmoid function. With such a configuration, it is possible toimprove the processing speed.

Next, a configuration example of a computer according to the exemplaryembodiment of the present invention will be described. FIG. 6 is aschematic block diagram illustrating the configuration example of thecomputer according to the exemplary embodiment of the present invention.The computer 1000 includes a CPU 1001, a main memory 1002, an auxiliarystorage device 1003, an interface 1004, and a display device 1005.

The empirical risk estimation system 100 described above may beinstalled on the computer 1000. In such a configuration, the operationof the system may be stored in the auxiliary storage device 1003 in theform of a program. The CPU 1001 reads a program from the auxiliarystorage device 1003 and loads the program into the main memory 1002, andperforms a predetermined process in the exemplary embodiment accordingto the program.

The auxiliary storage device 1003 is an example of a non-transitorytangible medium. Another example of the non-transitory tangible mediumincludes a magnetic disk, a magnetooptical disk, a CD-ROM, a DVD-ROM, asemiconductor memory or the like connected through the interface 1004.Furthermore, when this program is distributed to the computer 1000through a communication line, the computer 1000 receiving thedistributed program may load the program into the main memory 1002 toperform the predetermined process in the exemplary embodiment.

Furthermore, the program may partially achieve the predetermined processin the exemplary embodiment. Furthermore, the program may be adifference program combined with another program already stored in theauxiliary storage device 1003 to achieve the predetermined process inthe exemplary embodiment.

Furthermore, depending on the content of a process according to anexemplary embodiment, some of elements of the computer 1000 can beomitted. For example, when information is not presented to the user, thedisplay device 1005 can be omitted. Although not illustrated in FIG. 6,depending on the content of a process according to an exemplaryembodiment, the computer 1000 may include an input device. For example,empirical risk estimation system 100 may include an input device forinputting an instruction to move to a link, such as clicking a portionwhere a link is set.

In addition, some or all of the component elements of each device areimplemented by a general-purpose or dedicated circuitry, a processor orthe like, or a combination thereof. These may be constituted by a singlechip or may be constituted by a plurality of chips connected via a bus.In addition, some or all of the component elements of each device may beachieved by a combination of the above circuitry or the like and aprogram.

When some or all of the component elements of each device is achieved bya plurality of information processing devices, circuitries, or the like,the plurality of information processing devices, circuitries, or thelike may be arranged concentratedly or distributedly. For example, theinformation processing device, circuitry, or the like may be achieved inthe form in which a client and server system, a cloud computing system,and the like are each connected via a communication network.

REFERENCE SIGNS LIST

10 density estimation unit

20 integral estimation unit

30 storage unit

100 empirical risk estimation system

What is claimed is:
 1. An empirical risk estimation system comprising ahardware processor configured to execute a software code to: estimate aconditional probability density of a random variable, denoting a realvalue that is a result of a smooth function map of given unobservedcovariates, by training a regression model with a response correspondingto the random variable, and regressors corresponding to an observedcovariates; and estimate a one-dimensional integral of a product of asigmoidal function with an input random variable and a conditionalprobability density function of the random variable.
 2. The empiricalrisk estimation system according to claim 1, wherein the hardwareprocessor is configured to execute a software code to: estimate theconditional probability density of the random variable by a normaldistribution; and estimate the one-dimensional integral by using apiece-wise linear approximation of the sigmoid function.
 3. An empiricalrisk estimation method comprising: estimating a conditional probabilitydensity of a random variable, denoting a real value that is a result ofa smooth function map of given unobserved covariates, by training aregression model with a response corresponding to the random variable,and regressors corresponding to an observed covariates; and estimating aone-dimensional integral of a product of a sigmoidal function with aninput random variable and a conditional probability density function ofthe random variable.
 4. The empirical risk estimation method accordingto claim 3, estimating the conditional probability density of the randomvariable by a normal distribution, and estimating the one-dimensionalintegral by using a piece-wise linear approximation of the sigmoidfunction.
 5. A non-transitory computer readable information recordingmedium storing an empirical risk estimation program, when executed by aprocessor, that performs a method for: estimating a conditionalprobability density of a random variable, denoting a real value that isa result of a smooth function map of given unobserved covariates, bytraining a regression model with a response corresponding to the randomvariable, and regressors corresponding to an observed covariates; andestimating a one-dimensional integral of a product of a sigmoidalfunction with an input random variable and a conditional probabilitydensity function of the random variable.
 6. The non-transitory computerreadable information recording medium according to claim 5, wherein theconditional probability density of the random variable is estimated by anormal distribution, and the one-dimensional integral is estimated byusing a piece-wise linear approximation of the sigmoid function.